Draft: etcdctl: backup and restore #2366

xiang90 · 2015-02-24T03:33:40Z

freeze.etcd is a 5 member cluster.

# backup with no configuration
./etcdctl backup-noconf freeze.etcd/member/ backup.etcd/member

# restore to a two member cluster
# restore 1st
./etcdctl restore --backup-dir backup.etcd/member --restore-dir restore.infra1.etcd/member --cluster-configuration "infra1=http://127.0.0.1:7001,infra2=http://127.0.0.1:7002" -name "infra1"

# restore 2nd
./etcdctl restore --backup-dir backup.etcd/member --restore-dir restore.infra2.etcd/member --cluster-configuration "infra1=http://127.0.0.1:7001,infra2=http://127.0.0.1:7002" -name "infra2"

# start 1st
./etcd --data-dir restore.infra1.etcd/ -name infra1 -listen-peer-urls http://127.0.0.1:7001 -listen-client-urls http://127.0.0.1:4001

# start 2nd
./etcd --data-dir restore.infra2.etcd/ -name infra2 -listen-peer-urls http://127.0.0.1:7002 -listen-client-urls http://127.0.0.1:4002

xiang90 · 2015-02-24T03:55:13Z

/cc @kelseyhightower @philips

The previous force-new command cannot inject new configuration into the backup cluster. So @jedsmith, @barakmich and I propose a new workflow to restore a cluster directly from a backup-without-configuration.

kelseyhightower · 2015-02-24T17:43:01Z

This workflow seems pretty odd.

Ideally it would flow like this:

Spin up a new cluster with no data
Import exiting data from an existing cluster (etcdctl import < etcd.backup)

xiang90 · 2015-02-24T17:45:03Z

@kelseyhightower That is a chicken-egg problem. Basically, you will not be able to have a cluster without a data dir.

barakmich · 2015-02-24T18:05:36Z

@kelseyhightower As a thought experiment, it would also wouldn't be much of a restored backup; it'd be a replay.

Writing with a little inspiration from The Part Time Parliament the records in the log wouldn't quite be the same, even if they were in the same order (but they also may not be, if some other message were to come in in between -- but let's assume full operator control) . This log:
Entry 1: Default configuration of new cluster
Entry 2: "Entry 34: Write key"
Entry 3: "Entry 35: Write key"
--while indeed having the same net effect and apparent data, is a way different log than one where we've scrubbed the configuration entries and have, exactly miroring, history-wise, the data we backed up:
Entry 1: NOOP
....
Entry 34: Write key
Entry 35: Write key
......
Entry 200: Configuration of new cluster

By persisting the history, it's much easier to discuss what and when things happen. It also means that backups of things restored from backups share a piece of history with the original backup, which is both true and kind of what you want. The former case doesn't have that property, full-stop.

barakmich · 2015-02-24T18:12:46Z

etcdctl/command/backup_noconf_command.go

+	return snap, nil
+}
+
+func purgeConfInEnts(ents []raftpb.Entry) []raftpb.Entry {


this is doing the right thing, but could it make sense to make the entries that we skip into Noops instead of removing them?

philips · 2015-02-24T19:40:28Z

@xiang90 I thought we were going to replace the existing etcdctl backup with this, not add a noconf postfix.

xiang90 · 2015-02-24T19:44:20Z

@philips Correct. We need to kill backup and force-new together if people like this approach.

xiang90 · 2015-04-06T21:50:44Z

@philips @kelseyhightower @barakmich @yichengq @jedsmith

I want to push forward this pull request a little bit.

So our decision is:

depreciate the --force-new-cluster flag in etcd
introduce backup command into etcdctl, which only backup the data (no configuration) offline.
introduce restore command into etcdctl, which reconstruct a data-dir from the given data and configuration

Sound good?

philips · 2015-04-06T23:09:00Z

On Mon, Apr 6, 2015 at 2:50 PM, Xiang Li [email protected] wrote:

introduce backup command into etcdctl, which only backup the data (no configuration) offline.

How does this change the existing backup command? The current command
keeps configuration?

introduce restore command into etcdctl, which reconstruct a data-dir from the given data and configuration

This means that etcdctl would have to take all of the configuration
flags that etcd can? I think that having a subcommand on etcd like
etcd init or something would be preferable since etcd knows how to
deal with all of those flags, etc, already.

xiang90 · 2015-04-06T23:13:42Z

How does this change the existing backup command? The current command
keeps configuration?

When we depreciate the --force-new command in etcd, the previous backup in etcdctl makes no sense.

I think that having a subcommand on etcd like
etcd init or something would be preferable since etcd knows how to
deal with all of those flags, etc, already.

Sure. I like that more. Then we need to rethink the etcd init process and that will lead to more changes.

yichengq · 2015-04-07T19:11:09Z

This means that etcdctl would have to take all of the configuration
flags that etcd can? I think that having a subcommand on etcd like
etcd init or something would be preferable since etcd knows how to
deal with all of those flags, etc, already.

@philips @xiang90 I think when we restore from a backup, We just need to support static bootstrap for 100% correctness. It only needs to take --initial-cluster and name flags from etcd. This level of duplication is acceptable.
On the other hand, i don't want to reinvent the wheel of bootstrap process using etcd init without more strong motivations.

philips · 2015-04-07T19:27:32Z

My concern is adding WAL and configuration logic to etcdctl takes us down a road where etcdctl needs to know more about the internal details of etcd than it has before. With the exception of the etcdctl backup command everything etcdctl has done has been identical to what is possible from the "userspace API".

Any thoughts on this @barakmich @kelseyhightower ?

xiang90 · 2015-04-16T15:00:18Z

@barakmich @kelseyhightower Ping

Winslett · 2015-10-01T02:59:49Z

👍 - I've been orchestrating the recovery process and this sounds like a much better process.

colhom · 2015-11-17T22:32:06Z

I've got a pr up which automates the backup and restore procedures for etcd2 clusters.
#3882

As this procedure is a last-line-of-defense type of thing, I've opted for a simple/robust approach.

Pick a individual node's backup, and restore a single-node cluster from that backup using --force-new-cluster
Manually set advertised peer url for this node
All other nodes join the cluster with no data directory and simply catch up.

Requires the operator to pick the most recent backup, but removes any risk of the restore failing due to disagreements between nodes on ordering of events.

Here's my take on what should be improved:

etcdctl backup should wipe all configuration data.
etcd should be able to detect, on startup, if it's data directory is "fresh from backup" (devoid of cluster configuration data)
if etcd detects "fresh from backup", it will process the ETCD_INITIAL_* environment variables.

WRT to cluster configuration, the end result should be identical to what happens now if <data-dir>/member folder does not exist on startup.

This means that the operator can restore a failed cluster by:

stopping etcd2 service on all nodes
clearing all etcd2 data directories.
Copying the latest backup to any single node's data directory. This node is called the restore node.
Starting etcd2 service on all nodes

The bootstrapping should look very similar to how things work now if all the nodes don't have a data directory. The only difference is that restore node needs to be the founding member of the cluster.

etcd2 service lifecycle during restore is the same for all machines --> can batch commands to all nodes of the cluster.
There's no starting of intermediary etcd2 processes with special arguments
No more incrementally adding another member and starting another etcd2 node.
Not having to dynamically modify the service parameters for etcd2 in order to restore is also really key here.

As the desired cluster topology is already be expressed in the ETCD_INITIAL_* environment variables present in the systemd unit, it's no bueno to have the operator manually re-express that cluster topology incrementally via "member add/update" commands in order to restore.

/cc @xiang90 @philips

colhom · 2015-11-18T19:59:02Z

@xiang90 and i had a conversation this morning about the backup situation for v3.

Here's my notes on the discussion.

etcdctl backup will:
- strip all cluster configuration data
- regenerate a new clusterID

Restoring a new cluster from backup

The member/ folder generated by etcdctl backup will be distributed to each node

A new command, provisionally named etcdctl set-configwill operate on the member/ folder outputed by etcdctl backup on each machine. It will take parameters.

cluster-config: same format as -initial-cluster, a comma-separated list of <name>=<peer_url>. This argument is the same on for all nodes in the new cluster
name: This is the name of the etcd node. The peer url for this node can be determined by looking up the peer url for name in cluster-config

After etcdctl set-config <params> completes successfully, the member/ folder for that node is ready to be copied/moved to etcd's data-dir. After this is complete for all nodes, we can spool up etcd2 on each and expect a functioning cluster restored from backup.

xiang90 · 2015-11-18T20:57:22Z

@colhom Great summary! Thanks.

etcdctl: backup and restore

c25bab1

xiang90 added the tools label Feb 24, 2015

barakmich reviewed Feb 24, 2015
View reviewed changes

xiang90 mentioned this pull request Aug 6, 2015

improve the workflow for disaster recovery #3203

Closed

xiang90 mentioned this pull request Aug 23, 2015

Can't replace unhealthy member #3329

Closed

xiang90 mentioned this pull request Nov 20, 2015

etcdv3: backup design plan #3896

Closed

xiang90 closed this Mar 21, 2016

xiang90 deleted the backup branch March 21, 2016 23:02

colhom mentioned this pull request Mar 29, 2016

etcdctlv3: snapshot/recover design #4896

Closed

mumoshu mentioned this pull request Mar 6, 2017

etcd management kubernetes-retired/kube-aws#27

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: etcdctl: backup and restore #2366

Draft: etcdctl: backup and restore #2366

xiang90 commented Feb 24, 2015

xiang90 commented Feb 24, 2015

kelseyhightower commented Feb 24, 2015

xiang90 commented Feb 24, 2015

barakmich commented Feb 24, 2015

barakmich Feb 24, 2015

xiang90 Feb 24, 2015

philips commented Feb 24, 2015

xiang90 commented Feb 24, 2015

xiang90 commented Apr 6, 2015

philips commented Apr 6, 2015

xiang90 commented Apr 6, 2015

yichengq commented Apr 7, 2015

philips commented Apr 7, 2015

xiang90 commented Apr 16, 2015

Winslett commented Oct 1, 2015

colhom commented Nov 17, 2015

colhom commented Nov 18, 2015

xiang90 commented Nov 18, 2015

Draft: etcdctl: backup and restore #2366

Draft: etcdctl: backup and restore #2366

Conversation

xiang90 commented Feb 24, 2015

xiang90 commented Feb 24, 2015

kelseyhightower commented Feb 24, 2015

xiang90 commented Feb 24, 2015

barakmich commented Feb 24, 2015

barakmich Feb 24, 2015

Choose a reason for hiding this comment

xiang90 Feb 24, 2015

Choose a reason for hiding this comment

philips commented Feb 24, 2015

xiang90 commented Feb 24, 2015

xiang90 commented Apr 6, 2015

philips commented Apr 6, 2015

xiang90 commented Apr 6, 2015

yichengq commented Apr 7, 2015

philips commented Apr 7, 2015

xiang90 commented Apr 16, 2015

Winslett commented Oct 1, 2015

colhom commented Nov 17, 2015

colhom commented Nov 18, 2015

xiang90 commented Nov 18, 2015